Let’s learn to make plots in R! While there are some simple plotting functions built into base R (you will often see tutorials that use the plot() command), I encourage you to produce your plots and data visualizations using the ggplot2 package in R. This package takes a little getting used to, but once you understand the syntax you will be making effective graphs and visualizations in no time! Visualizing data is such an important part of the data analysis process: it helps us to better understand the data and its distribution, it allows us to identify and communicate patterns in simple and visually appealing ways, and it enables us to condense a large amount of technical information into a diagram or visual.


#Make sure you download ggplot2 first! Let's load in package and set the working directory.
setwd("/Users/aidenstanton/projects/R")
library(dplyr)
library(tidyr)
library(ggplot2)

#we'll start by loading in some data to play with! We'll use NYC temperature data for this tutorial. 

temps <- read.csv("temps_nyc.csv")

#Take a look at this dataset! It contains mean, min, and max temperatures in NYC for an entire year (2014).
#What if we wanted to plot the temperatures over time? We could plot it using base R like so:

plot(temps$day, temps$actual_mean_temp)


#for all plots, the syntax is usually (x = , y = ) - we'll put time (days) on the horizontal axis, and temperatures on the vertical axis. Putting the time variable on the x-axis is pretty standard. x is for independent variables and y is for dependent variables. 

This plot isn’t bad, but it isn’t very nice looking either. The ggplot2 package gives us so much flexibility to customize our plots - we’ll make a much nicer version of this soon. Before we get to that, we first need to learn a bit about the syntax of ggplot2.

There’s a few important things to point out about the code above. I have put the color = argument in the geom_point layer. This tells R to use the color blue for the points - when we create more complex graphs, being able to customize each geom layer individually becomes really important. Second, the color that I choose comes next in quotation marks. What happens if we leave them out?

blue <- "blue"

#here we've created an object that is assigned the color. 

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point(color = blue)

Hopefully that makes sense now! So, the graph of multiple variables looks pretty nice! But, there’s no legend on our graph! How will people know what each color represents? This is a somewhat annoying limitation of ggplot2, and it’s a problem that you’ll come across somewhat frequently.

#Here's the correct plot!
ggplot(temps) +
  geom_point(aes(x = day, y = actual_mean_temp, color = "Mean")) +
  geom_point(aes(x = day, y = actual_min_temp, color = "Min")) +
  geom_point(aes(x = day, y = actual_max_temp, color = "Max")) +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature Values")+
  scale_color_manual(labels = c("Max", "Mean", "Min"), values = c("red", "gray", "blue"))

Why is it so hard to work with this data? This data is in a format called “wide format”, where each category (mean, min, and max) is a different column/variable in the data. Ggplot prefers to work with data in what is called “long format”, where the categories are all neatly gathered in one column. Showing you how to convert from wide to long format is tricky and outside the scope of this tutorial, but I’ll load in a “long format” version of the data to show you the difference.

#load the long data 

long_data <- read.csv("long_temperature.csv")

head(long_data)

Do you see the difference? Now, the category (temp_type, or mean, min, and max) is in one column, while each temperature that corresponds to the temp_type and day is in the temp column. The wide and long data sets are just different ways of storing the same data! Now let’s see how this works in ggplot2.


#Now, instead of three geom layers, we will need to plot the data by three groups: mean, min, and max. Because we are grouping the data in the dataframe by the type of temperature recorded, we need to assign the temp_type column to the group argument. Because each type of temperature will also have a different color, we will assign temp_type to the color argument as well. Let's see what this looks like!

ggplot(long_data, aes(x = Day, y = temp, color = temp_type)) +
  geom_point() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")

Do you see why I changed the column names? R uses the categories in the temp_type column to add names to the legend. Keeping the “actual_mean_temp” (and so on) labels would not have been nearly as clear in a legend. In our graphs, we should aim to show complex information in the simplest way possible - having clear legend and axis titles is key to that. Now, in this case, the colors aren’t quite right! Let’s set them manually.

#create the color palette for the data
#Remember, order matters! Based on the order of the legend in the last graph, I will include the color for the max temp, then the mean, then the min. 

colors <- c("red", "gray", "blue")

ggplot(long_data, aes(x = Day, y = temp, color = temp_type)) +
  geom_point() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = colors)

Do you see how much simpler and shorter the ggplot2 code is now? Long format data is easier to work with, but I wanted to show you both data types so that you are prepared for any data format that you might encounter.

If you want to see what other colors are available for creating custom color palettes, check this link: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf.

So now you know the basics of graphing with ggplot2! There are just a few more topics to cover that you will find helpful. First, what if I don’t want to use a scatter plot? Ggplot2 comes with a wide range of geom possibilities! It’s so easy to produce different kinds of plots of your data. Let’s make a line plot with the data we already have.


ggplot(long_data, aes(x = Day, y = temp, color = temp_type)) +
  geom_line() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = colors)

The line plot doesn’t look nice and smooth because we’re working with daily data - there are a lot of data points, and the temperatures move around a lot! But as you can see, switching to a line plot was so easy. Next, we can look at a bar plot. Line plots, scatter plots, and bar plots will be the most common plots you’ll use.

#Remember that bar plots don't require x and y variables - we just need one y variable (in this case, temperature) and categories for the x-axis (in this case, temp_type). Let's make a simple bar plot for one day:

day1 <- long_data %>% filter(Day == 1)

ggplot(day1, aes (y =temp, x = temp_type, fill=temp_type)) + 
  geom_bar(stat = "Identity")+
  labs(x = "Temperature Type", y = "Temperature", title = "Temperatures in One Day in NYC") + scale_fill_manual (values = colors)


#Some things to note about this code: because I have included the categories as x-values, I need to include the argument stat = "Identity" in the geom_bar layer. Without going into too much detail, this argument tells R that the height of the columns should be equal to the temp values. Note also that instead of color, we use the fill= argument here - the color argument is for lines and points, while solid polygons need to be assigned colors using the fill argument. As an exercise, try seeing what happens when you use color = instead! When you use fill, the scale_color_manual argument also changes to scale_fill_manual to set the color palette. Finally, R will automatically generate a legend when you assign colors using the group aesthetic - I didn't need a legend in this graph, so I used the theme() argument to set the legend position to "none". This deletes the legend from the plot, and is worth remembering. 

Let’s think about one more way you can customize your plots: adding a visual theme. There are a number of themes you can add to your plot - I like the minimal theme best.


#I prefer the minimal theme that comes loaded with ggplot2 - it makes plots look very sleek 
ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = temp_type)) +
  geom_line() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = colors) +
  theme_minimal()


library(ggthemes)

Other theme options include theme_dark(), theme_light(), and theme_gray(). If you want additional themes, the ggthemes packages has great options.

And there you have it! You are now a pro at using ggplot2. You should have all the tools you need to make beautiful and effective visualization in R. If you want more information on different types of graphs, or you just want a helpful reference to refer to as you progress through the course, you can find an excellent ggplot2 cheat sheet here: https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf.

Resources

FiveThirtyEight (2014). US. Weather History. [Data Set]. Retrieved from: https://github.com/fivethirtyeight/data/tree/master/us-weather-history.

Prabhakaran, S. (2017). The Complete ggplot2 Tutorial - Part1 | Introduction To ggplot2. Retrieved from: http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html.

---
title: "Intro to ggplot2"
output: html_notebook

---
Let's learn to make plots in R! While there are some simple plotting functions built into base R (you will often see tutorials that use the plot() command), I encourage you to produce your plots and data visualizations using the ggplot2 package in R. This package takes a little getting used to, but once you understand the syntax you will be making effective graphs and visualizations in no time! Visualizing data is such an important part of the data analysis process: it helps us to better understand the data and its distribution, it allows us to identify and communicate patterns in simple and visually appealing ways, and it enables us to condense a large amount of technical information into a diagram or visual. 


```{r}

#Make sure you download ggplot2 first! Let's load in package and set the working directory.
setwd("/Users/aidenstanton/projects/R")
library(dplyr)
library(tidyr)
library(ggplot2)

#we'll start by loading in some data to play with! We'll use NYC temperature data for this tutorial. 

temps <- read.csv("temps_nyc.csv")

#Take a look at this dataset! It contains mean, min, and max temperatures in NYC for an entire year (2014).
#What if we wanted to plot the temperatures over time? We could plot it using base R like so:

plot(temps$day, temps$actual_mean_temp)

#for all plots, the syntax is usually (x = , y = ) - we'll put time (days) on the horizontal axis, and temperatures on the vertical axis. Putting the time variable on the x-axis is pretty standard. x is for independent variables and y is for dependent variables. 
```
This plot isn't bad, but it isn't very nice looking either. The ggplot2 package gives us so much flexibility to customize our plots - we'll make a much nicer version of this soon. Before we get to that, we first need to learn a bit about the syntax of ggplot2. 

```{r}
#Let's look at the first line of a basic ggplot graph:

ggplot(temps, aes(x = day, y = actual_mean_temp))

#When you use the ggplot() command, you need to supply a few key arguments. The first is the dataset - in this case, we will be using the temps data (as shown). The next part, called the aesthetic mapping or aes of the plot, tells us what we will be plotting from the dataset. Later, we will also include some characteristics of the plot in the aes() section. Can we plot the graph now? Not just yet! We need to add a geom layer - the geom layer tells ggplot2 what kind of visualization to produce with the data. We use a + sign to indicate a new layer in the plot like this (here I'm using geom_point to tell ggplot2 to draw a scatter plot):

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point()

#ggplot function (data frame, aesthetic mapping (x = name of independent variable in the data frame, y= name of dependent variable in the data frame)) + geom_type of graph
```
```{r}
#One of the nice things about ggplot2 is its flexibility. We can easily customize the plot. Once you get used to the syntax of ggplot2, customization is very simple. For example, let's start by changing the color of the points:

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point(color = "mediumpurple")


#ggplot function (data frame, aesthetic mapping (x = name of independent variable in the data frame, y= name of dependent variable in the data frame)) + geom_type of graph + geom_type of graph (color of data points = "selected color")
```
There's a few important things to point out about the code above. I have put the color = argument in the geom_point layer. This tells R to use the color blue for the points - when we create more complex graphs, being able to customize each geom layer individually becomes really important. Second, the color that I choose comes next in quotation marks. What happens if we leave them out?

```{r}
blue <- "blue"

#here we've created an object that is assigned the color. 

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point(color = blue)

```


```{r}
#What if we want to add labels to our plot? This is very easy to do with the labs argument, like so (remember that the x-axis is the horizontal axis, while the y-axis is the vertical axis):

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point(color = "blue") +
  labs(y = "Mean Temperature", x = "Day")

#ggplot function (data frame, aesthetic mapping (x = name of independent variable in the data frame, y= name of dependent variable in the data frame)) + geom_type of graph + geom_type of graph (color of data points = "selected color") + label (y= label of dependent variable, x = label of independent variable)

```
```{r}
#This is starting to look pretty nice! What if we wanted to add a title too?

ggplot(temps, aes(x = day, y = actual_mean_temp)) +
  geom_point(color = "blue") +
  labs(y = "Mean Temperature", x = "Day", 
       title = "Mean Daily Temperature in New York City, 2014")

#You'll notice that I like to put each new argument after a "+" on a new line - you don't have to do this, but I prefer to because it makes my code much easier to follow. I also like to put longer label names on a new line - again, this won't affect how the code runs, it just makes it more readable. 

#ggplot function (data frame, aesthetic mapping (x = name of independent variable in the data frame, y= name of dependent variable in the data frame)) + geom_type of graph + geom_type of graph (color of data points = "selected color") + label (y= label of dependent variable, x = label of independent variable) + title = "title of graph"

```
```{r}
#But what if we wanted to also graph the minimum and maximum temperatures on the same plot? this is also very easy to do! We just need to use a geom_point layer for each variable we want to plot. 

#here we start by telling R that we want to use the temps data for our plot
ggplot(temps) +
  #for each new geom_point layer, I need to include a new aesthetic mapping
  #this tells R which variable to use in the plot
  geom_point(aes(x = day, y = actual_mean_temp), color = "gray") +
  geom_point(aes(x = day, y = actual_min_temp), color = "blue") +
  geom_point(aes(x = day, y = actual_max_temp), color = "red") +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014")

#Notice here that the color arguments are outside of the aes() argument. This is intentional - only arguments that depend on variables in the dataset should be in the aes() argument. What does that mean? In this case, the color "gray" doesn't depend on anything in the data - for example, the color doesn't change for lower or higher values. The entire geom_point layer is just gray. If we had a variable called "color" in the dataset, or if we wanted the colors to change based on temperature values, we could put color inside the aes(). 

```
```{r}
ggplot(temps) +
  geom_point(aes(x = day, y = actual_mean_temp, color = actual_mean_temp)) +
  labs(y = "Mean Temperature", x = "Day", 
       title = "Mean Daily Temperature in New York City, 2014",
       color = "Mean Temperature (F)")

#this is color change based on variable values looks like
```

Hopefully that makes sense now! So, the graph of multiple variables looks pretty nice! But, there's no legend on our graph! How will people know what each color represents? This is a somewhat annoying limitation of ggplot2, and it's a problem that you'll come across somewhat frequently. 

```{r}

#We'll manually set the colors in the legend using scale_color_manual.For some reason, if you set the colors all at once, R will generate a legend; if you set each color individually in the geom_point layer, it won't. I don't make the rules, I just follow them! When you set the colors manually, you have to tell R what label you'd like to use for each geom layer. Here, I've set the label names using color = "" inside the aesthetic mapping in the geom layer. 

ggplot(temps) +
  geom_point(aes(x = day, y = actual_mean_temp, color = "Mean")) +
  geom_point(aes(x = day, y = actual_min_temp, color = "Min")) +
  geom_point(aes(x = day, y = actual_max_temp, color = "Max")) +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       #since the legend is based on the color mapping, use color = to set the legend title
       color = "Temperature Values")+
  scale_color_manual(labels = c("Max", "Mean", "Min"), values = c("red", "gray", "blue"))

#In scale_color_manual, we start by telling R which labels to use to generate the color scheme; in this case, it's the same labels we just set above. Then, we have to tell R which colors to use for each label. Because there are three color values to set, note that we have to use c() around the list of variable names and colors. 

#Scale_color_manual often involves some guessing and checking with the order of the colors - for some reason, R wanted to use the first color for the max temperature, the second for the mean, and the third for the min. This order makes no sense, but it is also not easy to change. Generally R defaults to alphabetical order, regardless of the order you specify each layer in a plot. If you notice that the colors in your graph don't match up, the easiest fix is to just change the order that you listed the colors and variable names so that it matches R's default ordering. That's what I did below. 
```
```{r}
#Here's the correct plot!
ggplot(temps) +
  geom_point(aes(x = day, y = actual_mean_temp, color = "Mean")) +
  geom_point(aes(x = day, y = actual_min_temp, color = "Min")) +
  geom_point(aes(x = day, y = actual_max_temp, color = "Max")) +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature Values")+
  scale_color_manual(labels = c("Max", "Mean", "Min"), values = c("red", "gray", "blue"))
```

Why is it so hard to work with this data? This data is in a format called "wide format", where each category (mean, min, and max) is a different column/variable in the data. Ggplot prefers to work with data in what is called "long format", where the categories are all neatly gathered in one column. Showing you how to convert from wide to long format is tricky and outside the scope of this tutorial, but I'll load in a "long format" version of the data to show you the difference. 

```{r}
#load the long data 

long_data <- read.csv("long_temperature.csv")

head(long_data)
```


Do you see the difference? Now, the category (temp_type, or mean, min, and max) is in one column, while each temperature that corresponds to the temp_type and day is in the temp column. The wide and long data sets are just different ways of storing the same data! Now let's see how this works in ggplot2. 

```{r}

#Now, instead of three geom layers, we will need to plot the data by three groups: mean, min, and max. Because we are grouping the data in the dataframe by the type of temperature recorded, we need to assign the temp_type column to the group argument. Because each type of temperature will also have a different color, we will assign temp_type to the color argument as well. Let's see what this looks like!

ggplot(long_data, aes(x = Day, y = temp, color = temp_type)) +
  geom_point() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")
```

Do you see why I changed the column names? R uses the categories in the temp_type column to add names to the legend. Keeping the "actual_mean_temp" (and so on) labels would not have been nearly as clear in a legend. In our graphs, we should aim to show complex information in the simplest way possible - having clear legend and axis titles is key to that. Now, in this case, the colors aren't quite right! Let's set them manually. 

```{r}
#create the color palette for the data
#Remember, order matters! Based on the order of the legend in the last graph, I will include the color for the max temp, then the mean, then the min. 

colors <- c("red", "gray", "blue")

ggplot(long_data, aes(x = Day, y = temp, color = temp_type)) +
  geom_point() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = colors)
```

Do you see how much simpler and shorter the ggplot2 code is now? Long format data is easier to work with, but I wanted to show you both data types so that you are prepared for any data format that you might encounter. 

If you want to see what other colors are available for creating custom color palettes, check this link: http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf. 

So now you know the basics of graphing with ggplot2! There are just a few more topics to cover that you will find helpful. First, what if I don't want to use a scatter plot? Ggplot2 comes with a wide range of geom possibilities! It's so easy to produce different kinds of plots of your data. Let's make a line plot with the data we already have. 

```{r}

ggplot(long_data, aes(x = Day, y = temp, color = temp_type)) +
  geom_line() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = colors)

```
The line plot doesn't look nice and smooth because we're working with daily data - there are a lot of data points, and the temperatures move around a lot! But as you can see, switching to a line plot was so easy. Next, we can look at a bar plot. Line plots, scatter plots, and bar plots will be the most common plots you'll use. 

```{r}
#Remember that bar plots don't require x and y variables - we just need one y variable (in this case, temperature) and categories for the x-axis (in this case, temp_type). Let's make a simple bar plot for one day:

day1 <- long_data %>% filter(Day == 1)

ggplot(day1, aes (y =temp, x = temp_type, fill=temp_type)) + 
  geom_bar(stat = "Identity")+
  labs(x = "Temperature Type", y = "Temperature", title = "Temperatures in One Day in NYC") + scale_fill_manual (values = colors)

#Some things to note about this code: because I have included the categories as x-values, I need to include the argument stat = "Identity" in the geom_bar layer. Without going into too much detail, this argument tells R that the height of the columns should be equal to the temp values. Note also that instead of color, we use the fill= argument here - the color argument is for lines and points, while solid polygons need to be assigned colors using the fill argument. As an exercise, try seeing what happens when you use color = instead! When you use fill, the scale_color_manual argument also changes to scale_fill_manual to set the color palette. Finally, R will automatically generate a legend when you assign colors using the group aesthetic - I didn't need a legend in this graph, so I used the theme() argument to set the legend position to "none". This deletes the legend from the plot, and is worth remembering. 
```
Let's think about one more way you can customize your plots: adding a visual theme. There are a number of themes you can add to your plot - I like the minimal theme best.  

```{r}

#I prefer the minimal theme that comes loaded with ggplot2 - it makes plots look very sleek 
ggplot(long_data, aes(x = Day, y = temp, group = temp_type, color = temp_type)) +
  geom_line() +
  labs(y = "Temperature", x = "Day", 
       title = "Daily Temperature in New York City, 2014",
       color = "Temperature (F)")+
  scale_color_manual(values = colors) +
  theme_minimal()

library(ggthemes)

```

Other theme options include theme_dark(), theme_light(), and theme_gray(). If you want additional themes, the ggthemes packages has great options. 

And there you have it! You are now a pro at using ggplot2. You should have all the tools you need to make beautiful and effective visualization in R. If you want more information on different types of graphs, or you just want a helpful reference to refer to as you progress through the course, you can find an excellent ggplot2 cheat sheet here: https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf. 


Resources

FiveThirtyEight (2014). US. Weather History. [Data Set]. Retrieved from: https://github.com/fivethirtyeight/data/tree/master/us-weather-history. 

Prabhakaran, S. (2017). The Complete ggplot2 Tutorial - Part1 | Introduction To ggplot2. Retrieved from: http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html. 
